Skip to content

Python: fix(python): unblock connect() when ClientSession or session.initialize() fails#14016

Open
nileshpatil6 wants to merge 2 commits into
microsoft:mainfrom
nileshpatil6:fix/mcp-connect-hangs-on-session-init-failure
Open

Python: fix(python): unblock connect() when ClientSession or session.initialize() fails#14016
nileshpatil6 wants to merge 2 commits into
microsoft:mainfrom
nileshpatil6:fix/mcp-connect-hangs-on-session-init-failure

Conversation

@nileshpatil6
Copy link
Copy Markdown

Description

Fixes #13414

The bug

MCPPluginBase.connect() spawns _inner_connect() as a background task and then waits on a ready_event:

self._current_task = asyncio.create_task(self._inner_connect(ready_event))
await ready_event.wait()

_inner_connect() has three failure points inside the if not self.session: block:

Failure ready_event.set() before raise?
Transport (get_mcp_client()) YES (correct)
ClientSession(...) creation NO (bug)
session.initialize() NO (bug)

When the MCP server returns HTTP 401/403, session.initialize() raises an exception. The handler closes the exit stack and re-raises KernelPluginInvalidConfigurationError, but never calls ready_event.set(). Because the exception is trapped inside the background task, connect() keeps waiting on ready_event indefinitely, hanging the caller until an external timeout fires (e.g. the 30-second PromptFlow deadline).

The same hang applies when ClientSession construction itself fails.

The fix

Two changes:

  1. _inner_connect: add ready_event.set() before raise in both the ClientSession and session.initialize() exception handlers, making them consistent with the already-correct transport-failure branch.

  2. connect: after await ready_event.wait(), check whether the background task has already finished with an exception and re-raise it. This ensures the error surfaces to the caller (including __aenter__) instead of silently succeeding with a broken connection state.

# connect() - after ready_event fires on an error path the task is done
if self._current_task.done():
    exc = self._current_task.exception()
    if exc is not None:
        raise exc

Testing

The existing test_mcp_plugin_failed_get_session test covers the transport-failure path. The two new paths (ClientSession failure and session.initialize() failure) can be verified with the same pattern used in that test by patching ClientSession.__aenter__ or ClientSession.initialize to raise.

Contribution Checklist

  • The code builds clean without any errors or warnings
  • The PR follows the SK Contribution Guidelines
  • All unit tests pass
  • I didn't break anyone

…ze() fails

When _inner_connect() failed to create a ClientSession or to call
session.initialize(), it raised KernelPluginInvalidConfigurationError
without first calling ready_event.set().  The connect() method was
waiting on that event in a background task, so it would hang
indefinitely (until an external timeout fired, e.g. 30 s in
PromptFlow).

The transport-failure branch already called ready_event.set() before
raising, so the pattern was consistent there but missing in the two
later branches.

Fix:
- Add ready_event.set() before raise in the ClientSession creation
  exception handler.
- Add ready_event.set() before raise in the session.initialize()
  exception handler.
- After await ready_event.wait() in connect(), check whether the
  background task finished with an exception and re-raise it so that
  callers (including __aenter__) receive the error instead of silently
  succeeding with a broken state.

Fixes microsoft#13414
Copilot AI review requested due to automatic review settings May 16, 2026 14:42
@nileshpatil6 nileshpatil6 requested a review from a team as a code owner May 16, 2026 14:42
@moonbox3 moonbox3 added the python Pull requests for the Python Semantic Kernel label May 16, 2026
@github-actions github-actions Bot changed the title fix(python): unblock connect() when ClientSession or session.initialize() fails Python: fix(python): unblock connect() when ClientSession or session.initialize() fails May 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Python MCP connector to address connection hangs when background connection setup fails, while also introducing sampling consent and MCP tool/prompt name conflict handling.

Changes:

  • Propagates background _inner_connect() failures after ready_event is set.
  • Adds sampling consent callback support across MCP plugin constructors.
  • Skips MCP tools/prompts whose normalized names conflict with existing plugin attributes.
Comments suppressed due to low confidence (1)

python/semantic_kernel/connectors/mcp.py:355

  • ready_event is set only after awaiting _exit_stack.aclose(). If cleanup raises while handling a failed session.initialize(), the background task exits without signaling the event and the caller still hangs in connect(). The event should be signaled reliably even when cleanup fails.
                await self._exit_stack.aclose()
                ready_event.set()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +296 to +303
# If the background task finished before (or exactly when) ready_event was
# set, it means it raised an exception on an error path. Re-raise it here
# so callers always receive the error rather than silently succeeding with a
# broken connection state.
if self._current_task.done():
exc = self._current_task.exception()
if exc is not None:
raise exc
Comment on lines 346 to +347
await self._exit_stack.aclose()
ready_event.set()
session: ClientSession | None = None,
kernel: Kernel | None = None,
request_timeout: int | None = None,
sampling_consent_callback: SamplingConsentCallback | None = None,
Comment on lines +300 to +303
if self._current_task.done():
exc = self._current_task.exception()
if exc is not None:
raise exc
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated Code Review

Reviewers: 4 | Confidence: 94%

✓ Correctness

The core bugfix is correct: adding ready_event.set() before raise in the two missing exception handlers in _inner_connect (ClientSession creation and session.initialize()) makes them consistent with the already-correct transport-failure branch, preventing connect() from hanging indefinitely. The post-wait check if self._current_task.done() in connect() correctly re-raises the exception to the caller, since in CPython's asyncio the task completes (via the raise) before the event loop resumes the waiter scheduled by ready_event.set(). The sampling consent callback and name-conflict detection additions are straightforward and correctly implemented. No correctness issues found.

✓ Security Reliability

The PR correctly fixes the connect() hang bug by adding ready_event.set() before raise in the two previously-missing exception handlers and re-raising task exceptions in connect() after the event fires. The new sampling consent callback follows a fail-closed design (denies on exception), and the name-conflict guard prevents a malicious MCP server from overwriting critical plugin attributes via setattr. No security or reliability issues found in the changed code.

✗ Test Coverage

The PR fixes a real hang bug by adding ready_event.set() in two exception handlers and re-raising background-task exceptions in connect(). New features (sampling consent callback, name conflict detection) have good test coverage. However, the core bug fix itself — the two new ready_event.set() calls and the error re-raise in connect() — has NO dedicated test. The existing test_mcp_plugin_failed_get_session only covers the transport-failure path that was already correct. The PR description explicitly identifies ClientSession-creation and session.initialize() failure as testable paths but no tests were added for them.

✗ Design Approach

The new conflict guard only snapshots dir(self) once, so it prevents collisions with built-in plugin attributes but still silently allows collisions between MCP functions loaded in different passes. A tool and prompt that normalize to the same local name can still overwrite each other, which means the change does not fully solve the name-shadowing problem it introduces tests for.

Flagged Issues

  • The primary bug fix (ClientSession creation failure and session.initialize() failure no longer hanging connect()) has no dedicated test coverage. The PR description acknowledges these paths are testable by patching ClientSession.__aenter__ or ClientSession.initialize to raise (following the existing test_mcp_plugin_failed_get_session pattern), but no such tests were added, leaving no regression guard for the exact hang scenario this PR fixes.
  • mcp.py:513-515 caches reserved names from dir(self) only once before loading begins, but MCP functions are later installed via setattr(self, local_name, func) at lines 538 and 553. A tool and prompt that normalize to the same local_name (e.g. both become safe_tool) will not trigger the conflict guard on the second pass; the later load silently overwrites the earlier one, and KernelPlugin.from_object (kernel_plugin.py:239-247) exports only the surviving attribute, silently dropping a remote capability.

Automated review by nileshpatil6's agents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python Pull requests for the Python Semantic Kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: Bug: MCPStreamableHttpPlugin.__aenter__() method doesn't treat HTTP 401/403 as fatal errors

3 participants